Detecting Depression States Based on Sensor Data

Data Science Pipeline Tutorial

By Anastasia Kortjohn



Introduction

Depression is a mood disorder that affects more than 264 million people globally, making it one of the leading causes of disability worldwide. It's characterized by symptoms such as profound sadness, the feeling of emptiness, anxiety, sleep disturbance, as well as a general loss of initiative and interest in activities. The severity of a depression is determined by the quantity of symptoms, their seriousness and duration, as well as the consequences on social and occupational function.

One way to classify depression is unipolar and bipolar; unipolar depression refers to major depressive disorder and bipolar depression is a facet of bipolar disorder. They are both genetic mood disorders and share symptoms, but a distincion should be made between the two: bipolar depression is unique in the periodic occurrence of mania, a state associated with inflated self-esteem, impulsivity, increased activity, goal-directed actions, and reduced sleep.

Although there are known, effective treatments for mental disorders, between 76% and 85% of people in low- and middle-income countries receive no treatment for their disorder. One barrier to effective care is inaccurate assessment: in countries of all income levels, people who are depressed are often not correctly diagnosed.

How does sensor data come into play?

Actigraphs are small motion sensor detectors (accelerometers) that are encased in a unit about the size of a wristwatch, and can be worn continuously for days to months. It is well established that depression is characterized by altered motor activity, and actigraph recordings of motor activity are considered an objective method for observing depression. Despite not being exhaustively studied yet, there is an increasing awareness in the field of psychiatry on how the activity data relates to various mental health issues such as changes in mood, personality, inability to cope with daily problems, or stress and withdrawal from friends and activities.

In the following tutorial, we will walk through the Data Science Pipeline to see if depression states can be accurately predicted through the sensor data recorded by Actigraphs.


References

The Dataset

We'll be looking at Actigraphic data originally collected for a study on motor activity in schizophrenia and major depression. Actigraphs continuously record an activity count proportional to the intensity of movement in one minute intervals. The dataset consists of actigraphy data collected for the condition group (23 unipolar and and bipolar depressed patients) as well as the control group (32 non-depressed contributors). We'll be using the The Montgomery-Asberg Depression Rating Scale (MADRS) score included in the data for each participant to identify the severity of an ongoing depression. The score is based on ten items relevant for depression, which clinicians rate based on observation and conversation with the patient. The sum score (0-60) represents the severity: scores below 10 are classified as an absence of depressive symptoms, and scores above 30 indicate a severe depressive state.


References

Getting started

First let's import the libraries we'll use throughout the tutorial.

Collecting and Curating the Data

The dataset contains:

Scores Data

To get the data for the control and condition groups from scores.csv, we uses pandas to read the csv file and store the data in a DataFrame. I have the scores.csv in a folder called 'data', so I use the relative path data/scores.csv to access it.

Managing and Representing the Scores Data

Right away we can see there is a lot of missing data, mostly comprised of the depression data for the control group. We expect this to be missing for the control group, because depression data is only collected for those in the condition group; thus it's considered missing at random (MAR). The control group is also missing data for education, marriage, and work. This data could also be considered MAR, because it looks as though only number, days, gender, and age were collected for the control group; therefore the data is missing because it's part of the control group. Since we'll be focusing mainly on the Actigraph data for this group, it shouldn't be a concern for the rest of the tutorial. For the condition group, there are three missing melancholia scores, and one missing education range value. It's not immediately clear which kind of missing data it is; after the next step we'll take a closer look to see if there's any correlation between the condition groups' missing data.

Notes

Since we'll be looking at differences between the condition and control group, let's split the scores data into a control group DataFrame and a condition group DataFrame. Even though we could use the numeric indices to select a subset of the DataFrame, in the event there is a greater number of rows / the partition of indices isn't clear, it may be useful to select rows based on the column values (e.g. control or condition) as follows.

We could drop all the NaN columns for the control group, but we'll keep it as is for now so it corresponds to the condition group DataFrame.

As mentioned previously, the condition group has missing data. It doesn't look as though it depends on any data in other columns, so it may be missing completely at random (MCAR). We can use different plots/figures to display any correlation between missing values. With only two columns containing missing data, the following missingno heatmap is not too informative, but it could be useful for larger datasets with more missing data.

Since the goal in identifying the type of missing data is to determine if and how it might affect future analyses, let's use seaborn to plot the scores we'll be using in the analysis (MADRS 1 and 2) on the X and Y axis, then see if the missing melancholia values correlate in any way. We can use the color and size to differentiate the NA values, as well as style to mark the education range values.

Other than 2 of the 3 NA points overlapping other plotploints, the figure doesn't show a particular correlation, e.g. that all the missing melancholia scores have the same MADRS 1 or 2 scores, or that all three have the same education range value. As a result, we can move forward with our analysis and treat the missing condition group data as MCAR.

Actigraph Data

Now that we've organized the scores data, we need to get all the Actigraph data. After getting the control group data, we will repeat the process for the condition group. Originally this step involved constructing two DataFrames—one for the control group and one for the conditional group—but considering the dates and amounts of measured Actigraph data don't necessarily align among the participants, there is no benefit of storing all the Actigraph data together in one DataFrame. Instead, we'll make a control list and a condition list, and fill them with DataFrames for each csv file (each participant).

Managing and Representing the Data

As the comments in the code describe, we parse the timestamps while reading the csv so they'll be formatted as DateTime objects, which may come in handy later on in the tutorial. Additionally, the first three elements of each list are displayed above, showing the partial DataFrames of Actigraph data and their row count (number of Actigraph recordings). Let's use matplotlib to get an idea of the data we're working with.

We can see from these 4 samples that the Actigraph data is spread over 1 or 2 months, and there is a lot of variation in activity intensity over time. The subplots have their own y-axis tick values, so we need to carefully look at the range if we want to compare between the 4 samples. From this initial glance at the data, we can see the upper range of the control groups' activity reaches 2,000 whereas it's less consistent for the condition groups. Condition 3 has a majority of the activity below 500, but there are scattered points up to 1000. Condition 2 has a majority of activity below 1000, but there are spikes going up to 3000-3500. Overall we've learned there are certain days or weeks during the couple months recorded where the activity spikes, and the intensity value is usually somewhere in the 0-5000 range.

Exploratory Data Analysis and Visualization

For the first step in the exploratory data analysis, let's see if we can better understand the range and other statistics about each group using boxplots.

fig, axes = plt.subplots(1, 2, figsize=(16, 8))

control_activity_plot = control_activity.boxplot(figsize=(24, 16), grid=False, \ boxprops=dict(color="r"), flierprops=dict(marker='.', markeredgecolor='darkred', markersize=1, alpha=0.5), \ whiskerprops=dict(color='r'), medianprops=dict(color='purple'), ax=axes[0])

control_activity_plot.set_title('Control Group Activity') control_activity_plot.set_xlabel('Group') control_activity_plot.set_ylabel('Activity (intensity)')

condition_activity_plot = condition_activity.boxplot(figsize=(24, 16), grid=False, flierprops=dict(marker='.', markeredgecolor='b', markersize=2, alpha=0.7), ax=axes[1]) condition_activity_plot.set_title('Condition Group Activity') condition_activity_plot.set_xlabel('Group') condition_activity_plot.set_ylabel('Activity (intensity)')

fig.tight_layout()

plt.show()

Hypothesis Testing and Machine Learning

Conclusion